Sequence analyses of Malaysian Indigenous communities reveal historical admixture between Hoabinhian hunter-gatherers and Neolithic farmers

Southeast Asia comprises 11 countries that span mainland Asia across to numerous islands that stretch from the Andaman Sea to the South China Sea and Indian Ocean. This region harbors an impressive diversity of history, culture, religion and biology. Indigenous people of Malaysia display substantial phenotypic, linguistic, and anthropological diversity. Despite this remarkable diversity which has been documented for centuries, the genetic history and structure of indigenous Malaysians remain under-studied. To have a better understanding about the genetic history of these people, especially Malaysian Negritos, we sequenced whole genomes of 15 individuals belonging to five indigenous groups from Peninsular Malaysia and one from North Borneo to high coverage (30X). Our results demonstrate that indigenous populations of Malaysia are genetically close to East Asian populations. We show that present-day Malaysian Negritos can be modeled as an admixture of ancient Hoabinhian hunter-gatherers and Neolithic farmers. We observe gene flow from South Asian populations into the Malaysian indigenous groups, but not into Dusun of North Borneo. Our study proposes that Malaysian indigenous people originated from at least three distinct ancestral populations related to the Hoabinhian hunter-gatherers, Neolithic farmers and Austronesian speakers.


Results
Population structure. To elucidate the genetic history of indigenous people of Malaysia, we sequenced 11 individuals belong to 4 Orang Asli tribes ( Fig. S1 and Table S1) at around 30 × coverage using with Illumina HiSeq 2000 platform and included 4 whole genome sequence from Dusun and Mendriq which we published earlier 24 . We used BWA v0.7.12 software to align the sequences to GRCh38 and GATK v3.5.0 for the variant calling. This dataset was merged with Human Genome Diversity Project (HGDP)-CEPH panel data 63 , Andaman Islanders 64 , Malay individuals from Singapore Genome Diversity Project (SSM) 65 . We also constructed a dataset using OAs and ancient AMH samples from southeast Asia (Table S2) to explore the historic link between these groups. We performed Principal Component Analysis (PCA) in order to understand the genetic structure of OAs and their relationship with the surrounding populations. PCA comparing indigenous populations of Malaysia with worldwide populations from the HGDP-CEPH dataset revealed that the indigenous Malaysians are genetically close to East Asian populations (Figs. S2 and S3). This suggests shared ancestors with EA or considerable gene flow between the two groups. On a finer scale, using populations from East, South, and Southeast Asia, OAs especially the Malaysian Negritos, exhibit an affinity towards the South Asians (SA) and Andamanese groups, while Dusun from North Borneo cluster closer to the East Asians (Fig. 1B). This implies a possible admixture between OAs and SA. To explore the relation of Malaysian Negritos with Hoabinhian hunter-gatherers and southeast Asian early farmers we carried out a PCA using ancient SEA samples. The ancient SEA dataset we used in this study includes two Hoabinhian individuals (La368 and Ma911) as well as several Neolithic farmers discovered in archeological sites across Malaysia, Vietnam, Laos and Thailand. PCA with ancient SEAs shows that the ancient samples belonging to the Hoabinhian culture cluster adjacent to the modern-day Andamanese (Figs. 1C and S4). Malaysian Negritos positioned intermediate between the Andamanese/Hoabinhian and EA clusters while the rest of OAs were closer to Neolithic SEA. We conducted ADMIXTURE analysis to infer the genetic ancestry of OAs. In ADMIXTURE analysis of OAs, South, Southeast, and EA populations, the crossvalidation score (Fig. S5) proposed that a model with five ancestral components (K = 5) was the best. At K = 5, Seletar (sea nomads) appeared to have a distinct (light blue) ancestral component (

Estimation of effective population sizes and divergence times.
To estimate the effective population size (Ne) and divergence time in OAs, we carried out a Multiple Sequentially Markovian Coalescent (MSMC2) analysis. In order to have a better resolution, we first included four randomly selected individuals (8 haplotypes) from each population in the MSMC2 analysis. This limited us to only Jehai and Seletar tribes which have sufficient sample size (Figs. 2a and S10). In later stage we tried to include all the tribes in the analysis by recruiting two randomly selected individuals (4 haplotypes) from each population or only 1 individual (2 haplotypes) for Mah Meri and Jakun tribe (Figs. 2b and S9). In general, OAs retained a lower Ne after around 30 kya than neighboring populations. These results could be further supported by the runs of homozygosity (ROH) analysis which revealed long stretches of ROH in OAs (Fig. S11). We found an increase in Ne in Dusun around 6 kya, which possibly coincides with the Austronesian expansion.
For the divergence time, we found that the split between Malaysian Negritos and EA took place around 14-13 kya (Fig. 2b) which is consistent with the results of our previous study using genotyping data 18 . Seletar and Dusun diverged from Han around 10 kya, which is in good agreement with the initial divergence of Austronesian from EA 44 . Overall, the divergences between different Malaysian groups were relatively recent. The divergence between Malaysian Negritos and Austronesians occurred around 12 kya, followed by a split from MahMeri around 9 kya. Jehai and Mendriq (two Negrito tribes) separated from each other approximately 2.6 kya.

Gene flow between indigenous Malaysian and neighboring populations.
To investigate potential gene flow in the history of indigenous Malaysian and modern and ancient populations, we performed a TreeMix analysis (Figs. 3 and S12). For modern populations, TreeMix suggested five migration events. The tree topology revealed that Malaysian Negritos formed a separate cluster while the other Malaysian indigenous groups clustered with EA populations. We identified gene flow between Andamanese and Malaysian Negritos. Our analysis also demonstrated gene flow between Dusun and Melanesian Bougainville. This may reflect the admixture between a population genetically close to today's Dusun in Borneo and a population with Papuan ancestry, attributed to the Austronesian expansion, which has been described by previous studies [45][46][47] . We also detected gene flow within OA groups, notably from Jehai (Negrito) to Jakun (Proto-Malay), and from MahMeri (Senoi) to Mendriq (Negrito).
TreeMix analysis of Malaysian indigenous and ancient SEA samples revealed similar topology. The Hoabinhian samples clustered separately and next to the Andamanese-Papuan clade, whereas Neolithic SEA clustered with modern EA. Interestingly, our analysis revealed gene flow from two ancient samples from Malaysia, namely Ma911 (Hoabinhian hunter-gatherer) and Ma912 (Neolithic farmer), to the Malaysian Negritos.
To better explore the existence of gene flow between OAs and the neighboring populations, within different OA groups and to further confirm the link between modern-day Negritos, Hoabinhian hunter-gatherer and Neolithic farmers, we conducted f4 tests (Table S4). To ascertain links between Malaysian Negritos and Andamanese, we calculated f4(Mbuti, Onge/Jarawa; X, Han), where X denotes the test population. We detected a significant f4-score when setting Jehai as X (Z = − 3.669 and − 3.921 for Onge and Jarawa, respectively); however, f4-scores for Mendriq were not significant. Computing f4(Mbuti, Oceanians; X, Han) displayed a significant f4-score between Dusun and Bougainville (Z = − 3.23). Testing the f4 between different OA groups, we found gene flow between the Malaysian Negritos and their neighboring Jakun (Proto-Malay) and MahMeri (Senoi) groups. However, there was no evidence of gene flow between Malaysian Negritos and Dusun of North Borneo. f4 estimates for ancient Malaysian samples (Ma911 and Ma912) and different OA groups demonstrated significant f4-score only between Ma911 and Malaysian Negrito, while Ma912 had significant f4 with both Malaysian Negritos and Senoi. We computed outgroup-f3 to measure the amount of shared drift between ancient Malaysians and OAs. Interestingly, we noticed that the Hoabinhian Ma911 share the most drift with Malaysian Negritos, while Neolithic farmer Ma912 shared the most drift with Senoi MahMeri (Fig. 4).    www.nature.com/scientificreports/ Analysis of OAs with ancient DNA from the Gua Cha revealed the contribution of populations genetically close to these samples into the Malaysian Negritos gene pool. The Gua Cha site is a rock shelter in northern Peninsular Malaysia. Based on Sieveking (1954), two archeological phases are recognizable at this site 54 . The Hoabinhian phase when the shelter was used for habitation and occasionally for burial, and the Neolithic phase when it functioned as a cemetery 55 . Radiocarbon dating showed that the Hoabinhian occupied the Gua Cha from 9 kya and later the Neolithic farmers used this site from 3 kya 56 . Our outgroup-f3 analysis is consistent with the archeological findings regarding the transition from hunting-gathering to farming lifestyle in the Gua Cha cave. While the Ma911 (Hoabinhian layer) shared most alleles with the Malaysian Negritos, the Ma912 (Neolithic farmer) was closer to the Senoi agriculturists. Our results confirm that modern Malaysian Negritos have been derived genetically from two ancient populations: the Hoabinhian hunter-gatherers and the Neolithic farmers who originated from South China or MSEA.
Our analysis detected gene flow between different OA tribes, notably between Malaysian Negritos, with MahMeri and Jakun tribes. The admixture between neighboring OA tribes or between OAs and the Malay population has been reported previously 18,57 . For example, Jinam et al. (2013) reported recent admixture between Jehai and their neighboring Malay, whereas such admixture was absent in Kensiu (another Negrito group). We did not find any traces of Negrito or Hoabinhian ancestry in Dusun. Likewise,  reported the absence of Negrito ancestry in North Borneo, Dayak, and Bidayuh populations. Considering the demographical and archeological evidence which supports the presence of Austro-Melanesian people on Borneo Island 58 , the best explanation for the absence of Negrito ancestry in Borneo could be the replacement of initial Austrolo-Melanesian inhabitants of the island by the Austronesians.
Interestingly, all the Seletar samples carried mtDNA N9a6b haplogroup. N96a haplogroup seems to be confined to the ISEA and reaches the highest frequency in Malaysia 32 . Our results are consistent with Jinam et al.
(2012) who reported only 4 mtDNA haplogroups (with N9a6b making up of 71% of mtDNA haplogroup frequency) in Seletar. Seletar are sea nomads who live along the strait of Johor (a waterway that separates Malaysia from Singapore). The history of Seletar is not well-documented. They are usually associated with the Orang Laut ("Sea people" in Malay), a conglomerate of sea nomad tribes who occupied the strait of Melaka 59 . Our TreeMix and ROH results indicate that the Seletar are genetically closer to the Austronesian speakers, but they experienced severe genetic drift.
In summary, our study suggests that at least 3 ancestral components were involved in shaping today's indigenous Malaysian populations, the Hoabinhian hunter-gatherers, Neolithic farmers, and Austronesian speakers. We also showed the genetic interaction between different Orang Asli tribes of Peninsular Malaysia.  (3)]. All methods were carried out in accordance to the principles of the Declaration of Helsinki. Before sample collection, we paid a courtesy visit to each tribe and obtained approval from the tribe's chieftain and district offices. We also received approval from the chairperson of the Committee for Village Development and Security for the North Bornean samples.

Samples. This study was reviewed and approved by the Monash University Human Research Ethics
For this study, we only recruited unrelated volunteer participants above 18 years old who provided written consent. We recruited 11 individuals including 5 Negrito (5 Jehai), 1 Senoi (MahMeri) and 5 Proto-Malay (4 Seletar and 1 Jakun). We collected peripheral blood (6 ml) from each participant and recorded their self-reported ethnicity and family pedigree (Fig. S1).
DNA extraction, sequencing and variant calling. We extracted the genomic DNA using a modified salting-out method 60 and the DNeasy Blood and Tissue kit (Qiagen) for the North Bornean samples. We performed sequencing with Illumina HiSeq 2000 at approximately 30 × sequencing coverage and paired-end read length of 100 bp. We also included 2 Negrito Mendriq and 2 North Bornean (Dusun) fastq files from our previous study 24 . We mapped paired-end reads to GRCh38 using the Burrows-Wheeler Aligner 61 (bwa mem) version 0.7.12. We removed PCR duplicate reads using the Picard MarkDuplicates tool version 1.93 (http:// broad insti tute. github. io/ picard/). We also performed post-alignment processing, for example, base quality recalibration or local indel realignment. To identify single nucleotide variants (SNVs) and small indels, we used GATK HaplotypeCaller 62 65 . To investigate the historic relationships between indigenous populations of Malaysia with other populations in Southeast Asia, we downloaded ancient SEA data 13,49 . We used the UCSC LiftOver tool to convert the genome coordinates of Andamanese and Malay and ancient SEA data from hg19 to GRCh38. We constructed two datasets. The first dataset comprised 15 indigenous Malaysians along with 1035 unrelated individuals from HGDP, SSM, and Andamanese dataset. After quality control for each population including missing rate per SNP < 0.05, minor allele frequency > 0.05, and Hardy-Weinberg exact (HWE) test (P < 10 −6 ), 3,374,375 shared SNVs remained. The second datasets included dataset 1 and 43 ancient SEA samples with 3,347,752 overlapping SNVs.
We performed principal component analysis (PCA) and ancestry estimation to assess the genetic structure of different indigenous groups within Malaysia and also the relationship between these groups and other www.nature.com/scientificreports/ neighboring populations. For this analysis, we filtered out SNVs with linkage disequilibrium (r 2 > 0.8) to eliminate the effects of excessive LD. After LD pruning, 812,971 and 806,229 SNVs remained from dataset 1 and dataset 2, respectively. In addition, we normalized the sample size by randomly selecting 10 individuals from each population in the HGDP and SSM datasets (Fig. S4). We used ADMIXTURE 66 analysis for estimating the ancestry and smartPCA from the EIGENSOFT package 67 for PCA analysis. For the PCA analysis of dataset2, we first computed the eigenvectors using modern samples and later projected ancient samples onto the first two PCs with "lsqproject" and "shrinkmode" parameters to account for excessive missing data in the ancient samples. We visualized ADMIXTURE results with pong software 68 (Figs. S5-S8). We conducted a Multiple Sequentially Markovian Coalescent (MSMC2) analysis 69 to estimate the effective population size (Ne) changes and divergence time of populations. We generated the VCF and masking files for 4 individuals per population, where applicable, according to the MSMC2 recommended parameters. For phasing the data, we used Eagle version 2.4.1 70 . We also assumed a mutation rate of 1.25 × 10 −8 per base per human generation and a generation time of 29 years 71 .
To further investigate the relationship between the populations and potential migration events, we inferred a maximum likelihood drift tree using TreeMix version 1.13 15 . We used blocks of 500 SNVs (-k 500) to account for LD and added 5 migration edges sequentially with 100 replications for each migration edge and Yoruba as root population. We examined gene flow between indigenous Malaysians and surrounding populations using ADMIXTOOLS package 72 .
We analyzed the full Y-chromosome and mitochondrial sequences by annotating them with the mutations commonly used for nomenclature. We used HaploGrep 73 and MitoSuite 74 to assign the mitochondrial haplogroups. For the Y-chromosome, we called the genotypes with SAMtools/BCFtools version 1.9 75 . We restricted calling to the 10.3 Mb region previously identified to be accessible for short-read sequencing 76 . We used yhaplo 77 to assign the Y-chromosome haplogroups. We used ggplot2 version 3.3.3 package (https:// ggplo t2. tidyv erse. org) in R version 4.0.4 (https:// www.R-proje ct. org/) for visualization 78,79 .

Data availability
The data that support the findings of this study are available through the European Variation Archive with accession number PRJEB48356.